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.— . There is significant interest in being able to predict where crimes will happen, for 

£N) example to aid in the efficient tasking of police and other protective measures. We 

aim to model both the temporal and spatial dependencies often exhibited by violent 
crimes in order to make such predictions. The temporal variation of crimes typically 
-^ follows patterns familiar in time series analysis, but the spatial patterns are irregu- 

lar and do not vary smoothly across the area. Instead we find that spatially disjoint 
regions exhibit correlated crime patterns. It is this indeterminate inter-region corre- 
lation structure along with the low-count, discrete nature of counts of serious crimes 
that motivates our proposed forecasting tool. In particular, we propose to model the 
crime counts in each region using an integer-valued first order autoregressive process. 
We take a Bayesian nonparametric approach to flexibly discover a clustering of these 
region-specific time series. We then describe how to account for covariates within this 
framework. Both approaches adjust for seasonality. We demonstrate our approach 
through an analysis of weekly reported violent crimes in Washington, D.C. between 
2001-2008. Our forecasts outperform standard methods while additionally providing 
useful tools such as prediction intervals. 
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1 Introduction 

Violent crimes are a significant concern in metropolitan areas across the United States. 
The impact of violent crimes on a city are many fold, ranging from harm to residents to loss 
of tourism, and the ability to curb such crimes is of utmost importance. In particular, there 
is significant interest in being able to predict regions in which crimes are likely to occur so 



that preventive measures may be employed in both the short- and long-term. Predicting 
crimes relies on the presence of dependence over time and among regions. The structure 
of the temporal dependence we find is familiar; for example, violent crimes are known to 
vary seasonally, with the rate of violent crimes higher during warmer months of the year 



(McDowall et al., 2012). The spatial dependence is more complex. One might expect, for 
instance, neighboring regions to experience similar crime rates. Geographic features, how- 
ever, can impede crime from varying smoothly across regions. For example in Washington, 
D.C, Rock Creek Park in the northeast and the Anacostia River in the south create natural 
impediments to the spread of crime. Boundaries also occur at a finer resolution (e.g. railroad 
tracks and highways). The presence of such idiosyncratic features makes it challenging to 
devise statistical models that are able to borrow strength among similar local regions without 
over-smoothing distinct regions. 

As an example of our methodology, we consider rates of violent crimes in the 188 census 
tracts in Washington, D.C. The nation's capital has consistently ranked among the top cities 
for rates of violent crimes in the United States. The crimes in our data consist of rape, rob- 
bery, arson and aggravated assault. Along with murder, these types of crimes define the FBI 
part 1 violent crimes list. Part 1 crimes are considered serious and are directly reported to 
the police (as opposed to other law agencies such as the IRS). Indeed, the Washington, D.C. 
police department keeps a record of all reported type 1 violent crimes and makes it publicly 
available through their website (http://crimemap.dc.gov/CrimeMapSearch.aspx). 

Figure [I] (left) shows a map of Washington, D.C. with boundaries of the census tracts 
superimposed. The sizes of the census tracts vary widely depending on population density. 
A census tract consists of adjacent street blocks selected to be homogeneous with respect to 
demographic features such as economic status and living conditions. According to the 2000 
Census, tracts in Washington, D.C. average 3043 residents, ranging from 149 to 7278. Due 
to the demographic homogeneity of census tracks, it is not surprising that neighboring tracts 



often have different crime dynamics. 
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Figure 1: Left: Map of the 188 census tracts in Washington, D.C.. Right: Weekly average 
violent crime counts across the 188 census tracts. 



The combination of demographic heterogeneity and natural boundaries contributes to the 
spatially diverse crime patterns across the census tracts. Figure [I] (right) shows the weekly 
average counts of violent crimes in Washington, D.C. between 2001 and 2008. This plot 
reveals several interesting features. First, the weekly average number of violent crimes are 
(fortunately) low. Second, tracts with similar average crime counts are often quite spatially 
disjoint. This suggests that crime, as hypothesized, does not vary smoothly across the city. 

The seasonal structure of weekly crime counts is difficult to see in sequence plots of the 
data for a single tract because of the small counts. The pattern becomes evident, however, 
in the aggregate. Figure [2] displays monthly crime counts accumulated over all of the tracts. 
This plot reveals the seasonal peaks and valleys. (Supplementary Material Section |E| includes 
plots of the counts for individual tracts.) Figure[2]also shows summary statistics accumulated 
over time for the individual tracts. In particular, it shows the distribution of the total counts 
per tract over these years and a graph of the within-tract standard deviation versus the mean 
for all tracts. 

The features of the weekly crime time series, namely (i) low-count discrete data with (ii) 
an uncertain correlation structure, necessitate the development of new forecasting tools. 
In particular, we propose a novel methodology which models each regional time series as 
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Figure 2: Top Left: Histogram of weekly crime rates in Washington, D.C.'s 188 census tracts. 
Top Right: Weekly standard deviation versus average violent crime counts across the 188 
census tracts. Bottom Right: Total monthly crime counts. 



an integer-valued first-order autoregressive process (INAR(l)) (|Alzaid and Al-Osh, 1988 



McKenzie, 2000). We induce correlation between the time series through the innovations of 



the INAR(l) models. We decompose these innovation processes into two latent factors: a 
shared seasonal effect and a rate function which is tract specific. We use multiple shrinkage 
to cope with the large number of tracts. We obtain this shrinkage adaptively by adopting 
a nonparametric Bayesian approach that imposes a Dirichlet process prior on the tract 



rates. The Dirichlet process leads to a clustering of rates and thus efficient sharing of the 
information between tracts in a flexible, data-driven manner. 

We begin in Section [3] with a purely geographic approach that is aligned with police 
procedures and work solely with raw crime counts within census tracts. We then examine 
how to account for covariates, in particular population size, in Section [7| 

We develop an efficient Markov chain Monte Carlo scheme for fitting these models, and 
we use this scheme to fit our model to simulated and real data. The results demonstrate 
that our proposed multivariate INAR(l) model produces out-of-sample forecasts that are 
more accurate than those of a conditional least-squares (CLS) model in both simulations 
and when predicting crimes in Washington, D.C. One reasonable explanation as to why 
our model outperforms CLS is because the number of clusters discovered is small relative 
to the number of tracts. Our model essentially shrinks the time series estimators that share 
a rate toward a common mean, thereby yielding better out-of-sample forecasts. Another 
advantage, a byproduct of the Bayesian framework, is that the Bayesian model provides 
posterior distributions of the p-step-ahead forecasts. These distributions are important in 
the context of forecasting crime because the distribution of crime is right-skewed and decision 
makers often care about preparing for the worst-case scenario. Prediction intervals for the 
number of violent crimes in each region can help the police distinguish between an unusual 
rise in violent crimes that requires intervention and a rise which is due to random variation. 

1.1 Related Approaches 

Multivariate Poisson-based models are a natural match to spatio-temporal time series of 
counts, and this structure has been employed in various prior applications. For example, 



Boudreault and Charpentier (2011) model the occurrences of earthquakes using a maximum 
likelihood approach to infer the model parameters of a multivariate INAR(l) process with 
Poisson innovations. (This formulation does not maintain Poisson margins, as discussed 



in Section [2]) Taddy (2010) employed Poisson processes to track the intensity of violent 



crimes in Cincinnati and treats these as point processes. Taddy (2010) factors the spatial 
Poisson rate into a process density, modeled using Bayesian nonparametrics, and an overall 
intensity. Both were allowed to evolve in time. Such a formulation, however, assumes spatial 



smoothness to the crime rates. Additionally, Taddy (2010) focus on in-sample inference 
rather than on predicting future events. In contrast, our research focuses on areal data 
modeling and provides methods to forecast these as multiple integer-valued low-count time 
series. We harness the efficient and elegant structure of INAR(l) processes and present a 
method for modeling multiple, correlated time series while maintaining Poisson margins. 
The correlations are induced via a Bayesian nonparametric clustering of the time series, 
and in doing so, we efficiently share information to produce more accurate out-of-sample 
predictions. Bayesian nonparametric methods have previously been studied as tools for 



data-driven clustering analysis (cf., Dorazio et al., 2008, Fox et al., 2010, Teh et al., 2006). 



However, these studies focus on clustering either continuous-valued time series or Poisson 
counts which have no time component. 

Our paper is structured as follows. In Section [2j we provide background on the Poisson 
INAR(l). We then present our method for correlating multiple such processes while main- 
taining Poisson margins in Section |3j Here, our focus is on correlation structure captured by 
geography alone. The associated posterior computations via Markov chain Monte Carlo are 
detailed in Section |4j In Section [5j we briefly provide a simulation study and in Section [6j we 
analyze our crime data of interest. Finally, in Section [7] we describe a method for accounting 
for covariates, and in particular consider population size as a predictor. 



2 Univariate PoINAR(l) Background 



A univariate PoINAR(l) model is defined as follows (Alzaid and Al-Osh, 1988): 



Y t+1 = aoY t + e t+ i for t = 0, 1, 2, ... , (1) 

where the innovations {e t } are iid Poisson. The operator o denotes binomial thinning. For 

any nonnegative integer- valued random variable X and for any a G [0,1], the random variable 

a o X is defined 

x 

aoI = ^5 t (a), (2) 

t=i 

where -Bj(a) are independent, identically distributed Bernoulli random variables with success 
probability a. Given a Poisson distribution on the initial state Yq with finite mean \x = EYq 
and independence between Y t and e t ~ Poisson((l — a)fi), the construction Eq. (JIJ yields a 
strongly stationary process. In essence, to obtain a stationary Poisson marginal distribution 
from Eq. (fll), the innovations must also be Poisson (Alzaid and Al-Osh, 1988, Steutel and 



Van Harn, 1986, Wolpert and Brown, 2011). 



3 Multivariate PoINAR(l) 

In this section we define a multivariate INAR process that retains Poisson margins. We 
first introduce the basic model and then demonstrate how to induce correlations among the 
component series by placing a Dirichlet process prior on the rate parameters of the Poisson 
innovations. We conclude this section by highlighting the similarities and differences between 
the proposed model and the vector autoregressive process, which is the equivalent model with 
Gaussian margins. 

Throughout, let Y^ t denote the number of violent crimes at tract / = 1, . . . , L during week 
t = 1, . . . , T. Furthermore, let Y t := (Y 1>t , . . . , Y Lt ) denote a vector of crime counts at time 



t and e t := (e^t, e 2 ,t, . . . , e^ )t ) the vector of innovations. 

3.1 A multivariate PoINAR(l) process 

One might imagine employing a multivariate PoINAR(l) analogue of a vector autoregres- 
sive process by simply considering: 



■t+x 



aoYt + e 



t T tt+l 



where a is an L x i matrix with entries in [0, 1] and ex o Y t is defined as: 



[aoy] (il := >^ atij o Y [>t , 



(3) 



z=i 



with ctijoYit defined by the binomial thinning operator Eq. pi). Even in the simplest scenario 
of €i t t being independent Poisson innovations, however, the resulting margins are in general 
not Poisson. In fact, it is straightforward to prove that when the off-diagonal elements of the 
thinning matrix ex are non-zero, a stationary distribution exists but is no longer the Poisson 



distribution (See McKenzie, 2000, Pedeli and Karliss, 2011). Such a multivariate INAR(l) 



was considered in (Boudreault and Charpentier, 2011). 



The one scenario in which Poissonicity is maintained is if ex. is diagonal, i. e. oiij = for 



i 7^ j, so that the model becomes 
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For notational convenience we will denote the diagonal elements by «j := a^. The diagonal 
thinning matrix implies that at time t, the I th entry of the thinned random vector is only a 



function of only the I th tract: 

[a o Y t ] t = a t o Y ltt . (5) 

For the innovations processes, we assume e^A^ ~ Poisson(A; t ). That is, conditional on 
the rate parameters, the innovations are independently Poisson distributed across time and 
space. The resulting multivariate INAR(l) yields a stationary Poisson distribution for each 
element in Y t . We refer to this process as the multivariate PoINAR(l). We emphasize that 
the diagonal thinning matrix not only dramatically reduces the number of model parameters, 
but also produces a stationary process with Poisson margins. 

Conditioning on the rate parameters {A^} yields L independent time series. To allow such 
models to capture dependencies between the time series, we introduce a Dirichlet process 
mixture model for the innovations in the next section. 

3.2 Capturing dependencies 

There are several ways to induce dependencies between the elements of the multivariate 
PoINAR(l) process. As previously mentioned, there are two sources of variation in the 
model: the multivariate binomial thinning operator and the innovation process. We propose 
to generate the dependence through the innovations and assume that the binomial thinning 
operators are independent across the time series. For our model of crime rates, this formu- 
lation shares information between tracts while allowing tract-dependent autocorrelations. 
Furthermore, this focus on the innovations provides computational efficiencies as described 
in Section HI 

Recall that the innovations are assumed to follow a Poisson distribution with rates A/ i4 . 
The rate is a function of both the tract I and the time period t of the specific innovation e/ i4 . 



We decompose the rate A; t into a product of a spatial and temporal components: 

Ai,t = \iO s (t), (6) 

where the summands are 

• a tract-specific rate, A;, and 

• a seasonal monthly rate, 6 s (t), that is spatially homogeneous. Here, s(t) is a function 
that maps week t to its associated month. That is, we assume a constant seasonal 
effect within months and model this effect with 12 parameters 0i, ... ,612- 

The resulting model for the innovations can be written as follows: 

e M ~ Poisson(Ai6 l s(t) ). (7) 

The temporal component induces some dependence across tracts because it is shared across 
the different time series. A Dirichlet process (DP) prior on the rates, A;, provides the balance 
of the dependence. 

The Dirichlet process, denoted DP(t, Go), provides a distribution over countably infinite 
probability measures. Go denotes a probability distribution over some space Q; in our 
application, Q is the positive real line. The concentration parameter r > controls the 
'clumpyness' of the process. A Dirichlet process can be written as a weighted sum of point 
masses S x positioned by randomly sampling Gq: 



G = ^A^ fe , fc ~G o . (8) 

fc=i 

The weights (3t can be obtained sequentially via the so-called stick-breaking construction 
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(Sethuraman, 1994): 



fc-i 



Pk = v k JJ(1 " n) Vk ~ Beta(l, r). (9) 

1=1 

In effect, the process divides the unit interval into segments with lengths given by the de- 
creasing sequence of weights /?&: the k th weight is a random proportion Uf. of the segment 
that remains after the first k — 1 weights have been chosen. We denote this distribution by 

/3 ~ GEM(r). 



The DP has proven useful in many applications due to its clustering properties (cf., Teh 



et al., 2006). The predictive distribution of draws W ~ G shows why the DP produces 
clusters. Because probability measures drawn from a DP are discrete by construction, there 
is a strictly positive probability of multiple observations of W taking identical values within 
the set {4>k}, with </>£ defined as in Eq. ([8]). For each sampled observation Wi, let Zi identify the 
position of the corresponding parameter cf) k such that Wi = 4> Zi . The predictive distribution 
on the membership variables can be written in the following manner: 



K + l w.p. j^- 
Z n+ i\(z 1 ,...,z n ,t) = { T (10) 



k w.p. jtt*- for k = l,...,K, 



where n k indicates the number of members belonging to the k th group and K = max{zi, . . . , z^} 
identifies the number of distinct values observed through the first iV samples. The distri- 



bution on partitions induced by the sequence of conditional distributions in Eq. (10) is 



commonly referred to as the Chinese restaurant process (CRP). The CRP provides an al- 



ternative representation to the DP (Pitman, 2006). This representation emphasizes the 



reinforcement property of the DP that leads to its clustering properties. It can be shown 
that the expected number of clusters using the CRP grows as 0{r log(L)) where L is the 



number of observations (see Teh, 2011, for a detailed proof.). This implies that the average 
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Figure 3: Graphical model of the multivariate dependent PoINAR(l) model. Le/t; Innova- 



tions generating from a Dirichlet process mixture model as in Eq. (11 ). Right: An equivalent 



representation using cluster indicator variables z\, . . . , zl as in Eq. (12). 



number of clusters is much smaller than the number of observations. 

In our model for crime rates, we impose a DP prior on the L tract-specific rates, A/. Note 
that in our application, the number of observations from the DP is equal to the number of 
tracts, rather than the number of time points. The DP prior thus groups the time series 
according to their corresponding tract-specific rates into a few clusters. Members of the A; th 
cluster share a common tract rate, 0*.. The grouping of the time series into a small number of 
clusters provides useful shrinkage that pools information across the cluster, thereby yielding 
more accurate out-of-sample predictions for the multiple time series. When we combine the 
tract-specific rates and the seasonal effects, we obtain the following generating process for 
the innovations: 



Q >t ~ Poisson(A; 9, 



a{t)j 



1 = 1,. 



.,L t=l,...,T 



0™~F m = l 12 



A,~G 1 = 1 L 



'ir 



G~DP(r,G ). 



For our application, we choose F to be a gamma distribution in Eq. (21 ) 
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Figure M\ (left) shows a graphical representation of our dependent multivariate PoINAR(l) 
process. For details on interpreting such graphical representations, the reader is referred 
to 



Jordan (2004). Alternatively, we can use an equivalent representation using the GEM 



distribution and the membership labels z 1 , . . . , zl (see Figure [3] (right)): 



€i >t ~ Poisson(<k, 9 8 (t)) l = l,...,L t = l,...,T 
6 m ~F m = l,...,12 



(12) 



Zi ~ Multinomial(/3) I = 1, . . . , L 
fa ~G ~ GEM(r) j = 1,2, 

3.3 Prior specification 

The multivariate PoINAR(l) requires estimation of three main components: 

• thinning values (a%, . . . , ocl), one for each time series. 

• monthly seasonal effects, (6*i, . . . , #i 2 ). 

• rates for each tract, [Ai, . . . , AJ. 

The Bayesian framework places prior distributions on each of these three elements. Our 
priors are both computationally convenient and weakly-informative. For the thinning values 
and monthly seasonal effects we specify: 



ai ~ Beta(r/i, 772) for I = 1, . . . , L (13) 

9 m ~ Gamma((i, £2) for ra = 1, . . . , 12. 



We also explored the half-normal distribution as a prior for the seasonal effect, and we found 
the choice did not reveal any material changes from the results presented in Section [6] The 
DP requires the specification of the base measure, Go, and the concentration parameter, r. 
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We choose the base measure to be the Gamma(7i,72) distribution, which is well suited to 
our model because not only it is conjugate to the Poisson distribution but it also provides 
a natural interpretation. In particular, we have a prior belief that weekly rates of violent 
crime are typically small, but a few tracts have higher rates. Hence, a gamma distribution 
with shape and scale parameters 71 = 1 and 72 = 0.1 reflects these prior beliefs. For the 



concentration parameter, we specify r ~ Gamma(a r , b r ), as suggested by Escobar and West 



(1994). 



3.4 Relationship to Multivariate Independent AR(1) process 

The continuous counterpart to the multivariate PoINAR(l) is the Gaussian multivariate 
first-order autoregressive process (VAR(l)). This process is composed from L independent 
AR(1) processes and can be formulated in the following manner: 

Y t+1 = a ■ Y t + e t +i t = 1, . . . , T 

e t |S~iV(0,S) (14) 

Ylo\^i,o, crifi ~ N(/Ji fi , ai fi ) I = 1, • • • , L. 

where [£]i,j = for i 7^ j and [£]i,i = erf. Compare this specification to the multivariate 
PoINAR(l): 

Y t+1 =aoY t + e t +i t = 1, . . . , T 

e t |A t ~ [Poisson(Ai j4 ), Poisson(A 2 , t ), • • • , Poisson(A Lj4 )] (15) 

Y lj0 \Ai j0 ~ Poisson(A ii0 ) I = 1, . . . , L. 

These two models not only share similar notation, but also possess two common character- 
istics: 

• The distributions of the innovations match the marginal distributions of Y^ t - The 
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PoINAR(l) with diagonal ex (having non-negative diagonal entries) has Poisson marginal 
distributions while the VAR(l) has Gaussian marginals for any ex. If the parameters 
defining the processes are chosen carefully, these models can be shown to have a Poisson 
or Gaussian stationary distribution, respectively 

• The autoregressive parameter a^i determines the autocorrelation coefficient in both 
models, corr(Y^+i, F M ) = a M . 

These similarities to the continuous VAR(l) make the discrete PoINAR(l) especially at- 
tractive and easy to interpret. The VAR(l) process, however, has a single source of variation 
- the innovations process - while the PoINAR(l) process has two - the binomial thinning and 
innovations processes. This key difference complicates inference for the PoINAR(l) model, 
which we address in Section HJ 

4 The MCMC Sampler 

As previously noted, the PoINAR(l) model is a combination of two underlying processes: 
the binomial thinning process and the innovations process. Each of these processes has its 
own parameters: binomial thinning uses the thinning parameters {a{\ and the innovations 
process uses the rates {<pk} and the seasonal effects {6 m }. For posterior computations within 
our Bayesian framework, we employ an MCMC sampler. Intuitively, the idea is to sample 
a posterior latent innovations sequence and then condition on this sequence to sample both 
the latent DP clustering of census tracts and also the thinning parameters and seasonal 
effects. In contrast, in the corresponding VAR(l) model there is no need to sample the 
innovations sequence since they are a deterministic function of the observations and the 
model parameters. Therefore, one would expect the multivariate PoINAR(l) model to be 
computationally cumbersome compared to its VAR(l) counterpart. However, our proposed 
sampler harnesses computational advantages from small observed counts in our crime data 
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and sufficient statistics implied by the Poisson model. We outline the resulting sampler 
below. For detailed derivations, see Section [A] of the Supplementary Material. 

1. Sample the innovations, e := [e 1; . . . ,€l] where ei := [e^i, . . . , 6/,t] is the innovations 
series for the I th tract. The posterior factors as: 

L T 

P(e\Y, a, A, 0) = J] J] P(e, )t |^, t _i, Yi, t , a u \ h 6). (16) 

1=1 t=2 

align Given the observations Y and the parameters of the multivariate PoINAR pro- 
cess, the innovations can be sampled independently for each tract and each time point. 
The possible values satisfy max{0, Y it — 3^ : t_i} < £/,t < Y [)t with corresponding proba- 
bilities 



P{ei,t\Yi, t -i,Yi it ,ai,\i,0) oc 



ei, t l(Xi,t - ei,t)K Y i,t-i ~ (Yi,t ~ ei, t ))\ \ a t 



(17) 



Although this expression does not define a well-known discrete distribution, it is ana- 
lytically tractable because of the small counts in the data (i.e., max{0, Y^ t — Yi t _{\ < 
ti.t < Yi,t an d Yi,t is assumed to be small). In the crime data for Washington, D.C, 
max Yi j = 11. Another important consideration that reduces the computational bur- 
den is that certain e^ t values can be deterministically set from the observations vector 
Y t : if yij = then e^t = and if yij-i = then e/,t = yi,t- Since our crime data has 
many zero counts, these constraints substantially lower the computational cost of this 
portion of the sampling. If larger counts are observed, then one can use a Metropolis- 
Hastings step to sample from this distribution with a Poisson proposal distribution. 
Importantly, note that if the observed counts are large enough one can apply a sta- 
bilizing transformation such as the square root and model the resulting process as a 
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VAR. This strategy has been shown by Brown et al. (2005 ) to yield satisfactory results 



in the univariate case when A/ is assumed to exceed (roughly) 5. 

2. Sample the membership indicator vector, z := [z±, . . . , z{\. We harness the DP-induced 
Chinese restaurant process (CRP) and iteratively sample tract-specific cluster indica- 
tors: 



( Tpi t0 for k = K + 1 
P(zi = k\z/i,e,G, 71,72, t) oc 

n k pi tk for k = 1,..., K, 



where K + 1 identifies a previously unseen cluster, = ^*=i ^s(t) an d Z/i is the 
vector of membership indicators, not including the I th term. The first terms, (r, rij), of 



Eq. (18) arise from the CRP prior of Eq. (10) and the exchangeability of the process 



such that each z\ can be treated as the last. The second terms, (pifl,Pij), correspond 
to the likelihood of the innovations e given the cluster assignments (zi = k, zn) and 
seasonal effects G, marginalizing the cluster- specific rates </>&. The terms are given by 
the following negative binomial distributions: 

Plfl ~ ~n^W UtW UtW (19) 



Pi, 



T{S l + A J + ll ) ( A e \ A > +J1 ( e ^ s ' 



r(A,- + 7i)$! V ^© + 72/ V% e + 72 



where Si = J2t=i e M an d A? = J2i- Zi =j i^i &i- Note that pi t0 and pij only rely on sums of 
the innovations and the sum of seasonal effects. We also highlight that the conditional 



conjugacy of our formulation allows us to use the collapsed sampler of Eq. (18) for the 
Zi, marginalizing {<pk}- 

3. Sample unique rates, <pk- Although the rates collapse away in sampling the cluster 
indicators, Zi, they are needed for sampling the innovations sequence (Step 1) and 
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seasonal effects (Step 3). As such, we instantiate the unique rates as auxiliary variables 
for these steps, and then discard them. For each currently instantiated cluster, sample 

<pk as: 

</>fc|e,z, 0,71,72 ~ Gamma(5 fc + jx,n k 6 + 72), (20) 

where B^ = J2ie{ v -z v =k} &i- Again, we only rely on the sum of the innovations, Si, to 
compute the posterior distribution. 

4. Sample the seasonal effects vector, [9\, . . . , #12]. The m th element of this vector can be 
sampled as: 

/ L L \ 

Q m \e } </>,&, & ~ Gamma I Jj Yl e ^ + ^ 1,9m zJ Xl + & ' ( 21 ) 

W=l t:s(t)=m 1=1 J 

where q m counts the number of occurrences of the m th month in the data. Notice that 
for this step we sum the innovations over tracts rather than time. 

5. Sample the vector of thinning parameters, [«i, . . . , aj. For tract I, 

T T 

a,|e,, Y h rn, m ~ Beta(^ Y l)t - 5, + rji, ^0W - *w) + ^ + *fc)> (22) 

t=2 t=2 

where Si is defined as in Step 2. 

6. Sample the concentration parameter, r, for the Dirichlet process prior according to 



Escobar and West (1994). 



It is important to note that if the model did not include seasonal effects, then one could 
simply sample the sum of the innovations, Si, instead of the vector of innovations, e/. This 
would reduce the computational cost of the sampler since Step 1 is the most time consuming. 
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5 Simulation Examples 

In order to demonstrate the performance of our model, we simulate datasets from 9 different 
multivariate PoINAR(l) processes of Eq. Q. Each dataset has L = 100 time series (tracts) 
with T = 208 observations which correspond to 4 years of weekly data samples. We group 
the multiple time series into four equally sized clusters that each share a common rate. The 
data sets vary in the choice of: 

• The separation between the cluster rates, <fik- We examine an "easy" setting in which 
the four cluster rates are well separated at 1,3,6,10, a "medium setting" with less 
distinct rates 0.01,0.5, 1.2,2, and a "hard" setting with rates 0.1,0.2,0.3,0.6. 

• The thinning values, ai, which determine the autocorrelation of the individual PoINAR(l) 
processes. The examples use a common choice for a\ for all tracts in a dataset, choosing 
from a, = 0.1,0.5,0.9. 

We evaluate the root mean square error (RMSE) and absolute percentage error (APE) of 
our MCMC sampler both in- and out-of-sample. These metrics measure the distance between 
the true population expected value and its corresponding estimate based on the observed 
L = 100 time series. The simulation results show that our model produces accurate out-of- 
sample forecasts under various configurations. Table [T] presents the RMSE of our Bayesian 
nonparametric model compared to the RMSE of a simple Poisson process model (SPP) and 
the conditional least-squares model (CLS). (The Supplementary Material, Section [Bj details 
both of these.) The results in Table II] show that our model outperforms these alternatives. 
As expected, the larger the separation between the cluster rates, the easier it is for our 
method to identify the true clusters and yield better estimates for their parameters. Also, 
higher autocorrelation helps our method produce more accurate estimators. 

These simulation results indicate that the sampler finds clusters when they exist. It is 
also important to demonstrate that the model does not spuriously spawn clusters when the 
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Thin=0.1 


Thin=0.5 


Thin=0.9 


Rates 


Easy 


Med 


Hard 


Easy 


Med 


Hard 


Easy 


Med 


Hard 


SPP RMSE 


0.477 


0.113 


0.005 


1.674 


0.880 


0.293 


6.128 


1.155 


0.552 


CLS RMSE 


0.306 


0.080 


0.035 


0.284 


0.114 


0.057 


0.343 


0.118 


0.055 


BNP RMSE 


0.219 


0.058 


0.026 


0.260 


0.086 


0.045 


0.299 


0.075 


0.043 


e(x.,t+i) 


5.383 


1.001 


0.317 


9.861 


1.848 


0.591 


52.161 


9.908 


3.0633 



Table 1: RMSE of estimates of the conditional mean obtained by the CLS, SPP and our 
Bayesian nonparametric method. The last row shows the expected (true) conditional ex- 
pected value. 

data are homogeneous. As part of our simulations, we also examined the performance of 
these methods in the situation in which a single process (cluster) generates all of the time 
series. The findings are presented in Section [C] of the Supplementary Material, which also 
contains a more detailed description of the simulations and results presented in this section. 
As one would hope, under these conditions we identify a single cluster, further validating 
our methodology. 

6 Violent Crime Data Analysis 

In this section we examine both in- and out-of-sample results using the reported counts of 
violent crimes in Washington, D.C as described in Section [T} The data consist of L = 188 
time series (census tracts) with T = 418 weeks of counts in 2001 through 2008. We use the 
first 7 years of data to train our model and the last 52 weeks to evaluate its out-of-sample 
forecasts. We ran 5 MCMC chains for 5,000 iterations from different initial values, each 
drawn from the following priors: 



8 m ~ Gamma(l, 1) a\ ~ Beta(l, 1) 
t ~ Gamma(2,4) fa ~ Gamma(l, 1). 



(23) 



We performed a sensitivity analysis for the hyperparameters during the simulation stage, 
but found no significant changes to the results. We discard the first 1000 iterations as 
burn-in and then thin the remaining 4,000 samples by 50. Therefore, our inference for each 
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Figure 4: Histograms of the posterior number of clusters for the multivariate dependent 



PoINAR(l) described in Section 3.2 (left) and the population adjusted multivariate depen- 
dent PoINAR(l) model described in Section [7] (right). 

parameter of the model is based on the resulting 80 • 5 = 400 MCMC samples. We use the 



scale reduction factor (recommended by Gelman and Rubin, 1992) to monitor convergence 
across the chains. 

We begin by looking at the distribution of the number of clusters over the 400 iterations 
in the left panel of Figure |4j The mode is 17 clusters, which is a substantial reduction 
from the original L = 188 time series. Figure [5] presents a representative cluster assignment 
along with the posterior rates for this assignment. This cluster assignment is selected as the 
assignment that has the minimum average Hamming distance across the different iterations 
(see 



Fox et al., 2011, for further details.). An interesting phenomenon is that census tracts 



assigned to the same cluster are frequently spatially separated. 

We further examine the posterior means for the rates, A;, of the 188 census tracts and 
their corresponding thinning values, ati, across the MCMC samples. Figure M (left) maps the 
posterior mean rates for the census tracts in Washington, D.C.. We can see certain regions 
that exhibit higher rates (e.g., tracts that correspond to a southern portion of the city, a 
central portion along 16th Street, and an east-central portion along Rhode Island Avenue.) 
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Figure 5: The minimum average Hamming distance cluster assignment along with the cor- 
responding posterior rate values. 

The results of Figure [6] are also substantiated by Figure [l] (right). Figure [7] compares, for 
each census tract, the sample autocorrelation in the counts for each tract with the posterior 
mean thinning values. The sample autocorrelation is calculated using the classical first order 
autocorrelation estimator for each time series separately (without adjusting for seasonality). 
As previously explained, the thinning values in our model determine the autocorrelation for 
each INAR(l) time series. The comparison shows that the raw data autocorrelations vary 
on a wider range of values than their corresponding posterior mean values and some of these 
raw autocorrelations are slightly negative. There are two reasons that can account for the 
differences between the two: 

1. Our model only allows the thinning value to range between [0, 1] and therefore cannot 
account for negative autocorrelation. We believe that the (small) negative raw auto- 
correlations are probably due to noise variation and therefore we are less concerned 
about this phenomenon. The standard error of an estimated first-order autocorrela- 
tion for white noise is approximately 1/y/T = l/v418 ~ 0.05; hence the bulk of raw 
autocorrelations are within about 2 standard errors of zero. 
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Figure 6: Map of posterior mean rates, A;, sampled from the multivariate dependent 



PoINAR(l) model described in Section 3.2 (left) and the population adjusted multivariate 



dependent PoINAR(l) model described in Section [7j (right). 

2. The posterior mean thinning values are adjusted for seasonal effects, whereas the raw 
autocorrelations are not. The values would be smaller in magnitude after adjusting for 
the seasonality, as our results suggest. 

For the out-of-sample evaluation we compare our MCMC method to the CLS and SPP. 
For both the CLS and our method, we use the estimated one-week-ahead conditional mean 
as the predictor for the crime counts in each tract: 



Vi,t+i = oil yi,T + A/ S (t+i)- 



(24) 



For CLS, we simply plug-in the estimates of ai, \i and 9 m for each tract. For our method, 



we compute an MCMC-based estimate by evaluating Eq. (24) for each of the 400 MCMC 



iterations and use the average of these as the final predicted value (see Section C.2 of the 



Supplementary Material for further details). For the SPP, we average the past values as the 
predictor. 
We predict the one-week-ahead number of crimes in each tract for the first week of each 
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Figure 7: Histogram of raw data autocorrelations (left) and posterior mean autocorrelations 

a h (right). 

month during 2008. Table [2] shows the one- week-ahead predicted mean RMSE and corre- 
sponding standard errors, conditional on the last observed value. The results indicate that 
when the last observed count is one of the most frequent values (0,1,2), our method pro- 
duces lower RMSE. For the less frequent, higher counts (3,4), the performances of all of the 
methods are (statistically) equivalent. This behavior is to be expected since our method 
shrinks the estimators toward the mean and therefore should perform better for lower, more 
frequent counts and worse in the more rare cases of high counts. A summary of the average 
one-week-ahead bias is presented in Section [F] of the Supplementary Material. In general, 
our method produces the smallest bias, but the differences between the methods are not 
significant except when the last observed count is zero. 

The one-step-ahead conditional mean value is the best linear unbiased estimator under 
quadratic loss. Since the CLS method minimizes the observed squared error, it is only 



natural to evaluate all three methods using the same loss function. Alternatively, Berk 



(2008 ) proposed a quantile loss function that reflects the sensitivity of the police department 
to forecasting errors. Under a -u-quantile loss function, the predictor is just the predictive 
distribution's v quantile. Using our method, one can easily sample from the following one- 
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Figure 8: The predictive posterior distribution for each of the 188 tracts. The red dots 
correspond to the median predicted number of violent crimes for each tract. The blue trian- 
gles and green crosses correspond to the 95% and 99% percentiles of the predictive posterior 
distribution, respectively. The black stars corresponds to the test-set actual observed value 
of crimes. 

step-ahead predictive posterior distribution and evaluate any desired quantile: 



paw = tt,min« = w, °t\ e {m \ A, (m) ) 



r=0 



J2 ( W,t ~ l ) (<4 m) ) Vl ' t ~ r (1 - al m) )^*-i-<*,*-r) 



w — ^+1)^^ (25) 



where A 



(m) 



(m)? Q^ an d oq are the rate, seasonal component and thinning value 



estimated during the m th iteration of the MCMC sampler. Figure [8] shows the 95% and 
99% quantiles for each of the 188 tracts and the corresponding one-step-ahead true value, 
yi,T+i- The quantiles may also be used to provide prediction intervals for each tract. A police 
department can use these intervals along with the point estimate to distinguish between an 
unusual surge in crimes which requires allocation of more resources, and a random rise in 
crimes, which would not benefit from an intervention. 
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1 


2 


3 


4 


Overall 


SPP RMSE 


0.8373 
(0.034) 


0.966 
(0.0311) 


1.1829 
(0.0453) 


1.4722 
(0.088) 


1.4252 
(0.1631) 


0.970 
(0.0167) 


CLS RMSE 


0.7729 
(0.0245) 


0.9501 
(0.0430) 


1.0605 
(0.0660) 


1.1370 
(0.0982) 


1.3258 
(0.1991) 


0.9235 
(0.0368) 


Dependent PoINAR RMSE 


0.7222 

(0.0135) 


0.9172 

(0.0172) 


1.009 

(0.4336) 


1.0225 

(0.0862) 


1.1600 

(0.1782) 


0.72168 

(0.0016) 


Frequency 


0.5900 


0.2340 


0.1160 


0.0400 


0.0200 


1 



Table 2: One-step-ahead average RMSE as a function of the last observed value of y.^- We 
also provide the standard errors associated with the average RMSE. 



7 Covariates Adjusted Dependent PoINAR(l) 

Previous research has shown that the tracts of crime are associated with demographic co- 
variates, and we have several ways to incorporate such features into our Bayesian model. For 



example, Blei and Frazier (2010) incorporate covariates directly into the clustering mech- 



anism. This approach might improve the accuracy of forecasts, but it would provide less 
in the way of interpretation, such as how the various covariates associate with crime rates. 
Instead, we take a more direct approach that offers the advantages of clustering as well as 
interpretation. We model the tract-specific rate A/ as a linear function of covariates and clus- 
ter the coefficients of the equation. The clusters of coefficients may provide further insight 
into the relationships between crime and demographic characteristics. 

7.1 Adjusting for population 

The main goal of this section is to demonstrate how to add covariates to our model and 
to explore the benefits of doing so. To this end, we look at the population sizes in each of 
the census tracts as a possible explanatory variable. 

Let Xi denote the population of the I th census tract. (We obtain the populations from 
the 2000 census. Section [D] of the Supplementary Material shows a map of the population 
density in Washington, D.C.) To incorporate population into our model, we redefine the 
tract-specific rate as a linear function of population, i.e. A; = Xitpi, where ipi is the number 
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of violent crimes per person in the I th tract. We then place a DP prior directly on the rate 
per person parameter, ipi, yielding the following model: 

e M ~ Poisson(Xiipi9 s( t)) l = l,...,L t = l,...,T 

6 m ~F(uj) m = l,...,12 

(26) 

^~G 1 = 1,. ..,L 

G~DP(r,G ). 

It is straightforward to adjust the MCMC sampler described in Section [4] to incorporate 
the population covariate, Xi. We change the base measure Go to Gamma(0.5,0.5) to reflect 
the adjustment for population sizes while remaining weakly informative. After these simple 
modifications, we run the sampler in the manner previously described in Section |4| 

7.2 Analysis of results 



Using the covariate-adjusted PoINAR(l) of Section 7.1, we again analyze the counts of 
violent crimes in Washington. As in Section [6j we begin by showing the posterior distribution 
of the number of clusters over the 400 MCMC iterations (again taken from 5 chains, each 
run for 5,000 iterations). Figure [4] (right) indicates that the distribution is much narrower 
when we adjust for the population density, and has a mode of 14 clusters. This suggests 
that population can account for a significant amount of the spatial heterogeneity in crime. 
Figure [9] maps the posterior means of the crime rate per person, ipi. This map highlights 
three main features: 

1. The center of Washington, D.C. has a high count of violent crimes per person. 

2. The northwest portion of the city has very few crimes per person. 

3. The city has three hot-spots: in the center of the city and in the eastern and south- 
western portions of the city. 
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Figure 9: Map of the posterior mean values for crime rates per person, ip^. 



These insights, also highlighted by Cahill and Roman (2010), differ from the conclusions one 



would make by simply looking at the mean values as shown in Figure [T] or from our previous 
analysis. The results emphasize tracts which exhibit high crime rates per person as opposed 
to high crime counts and can help police make future planning decisions, such as where to 
place a new station. These outcomes, important as they may be, are merely a byproduct of 
our estimation method. The more interesting research question is whether the population 
covariate can improve the prediction abilities when compared to the unadjusted model. 
We performed a one-week-ahead forecast for the last week in the series using both the 



unadjusted model of Section |3.2| and the adjusted model accounting for population. The 
overall RMSE of the unadjusted model was 0.9663 whereas the adjusted model was 0.9713. 
These results suggest that adding the population of the tract to the model does not neces- 
sarily improve predictive accuracy. However, a more extensive analysis would be needed to 
confidently settle such issues. Although adding the population of the tract to the model may 
not improve predictive accuracy, adding the covariate seems to provide a useful benefit in 
that the revised model reveals a more interpretable grouping of the time series. Of course, 
there are many other covariates that one could consider, for example measures of poverty, 
housing characteristics, etc. 



8 Discussion 

In this paper we have presented a method of forecasting multiple correlated low-count 
time series building on the univariate PoINAR framework. The model induces correlation 
between the different time series through two sources: an overall temporal seasonal effect and 
a clustering on individual rate parameters. The latter clustering is induced by a Dirichlet 
process, which encourages sparse representations in terms of a small number of clusters. The 
grouping of the different rates allows our model to borrow strength across the different time 
series, shrinking the estimators to provide better out-of-sample forecasts. 

Our model assumes that there is some underlying clustering assignment of the multiple 
time series. Moreover, once these clusters are identified, they remain fixed throughout time. 
One can relax this assumption and allow for temporally evolving cluster assignments. There 
are a few ways to create such a mechanism, for example we might impose dependent Dirichlet 



process priors, such as those examined by Taddy (2010). 



Finally, although our focus here is on counts of violent crimes, this model is broadly 
applicable to many low-count spatio-temporal data sets, including the number of insurance 



claims across the U.S., earthquakes across the globe (Boudreault and Charpentier, 2011) 



wildfires across counties (Xu, 2011), and so forth 
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SUPPLEMENTARY MATERIAL: 

Spatio-temporal Low Count Processes with Application to Violent Crime 

Events 

This supplementary material provides further details on our MCMC sampler in Section [X] 
and the baseline models to which we compare in Section [Bj In Section [Cj we elaborate upon 
the simulation studies outlined in the main paper. Finally, in Sections [D]- [F], we provide 
additional details on our Washington, D.C. crime data and results. 

A MCMC sampler derivation 

In this section, we detail the derivation of the sampling steps presented in Section [4] of the 
main paper. We first introduce the following notation: 

•Si — J2t=i e i,t 

• S = [Si, . . . , Sn] 

• S-i = [Si, . . . , Si-i, Si+i, . . . , Sn] 

• %-i — [^1, Z2, ■ ■ ■ , £j_i, z i+ i, . . . , Zn] 

• © = E*=i 0*(t) 

Step 1 - Sampling the innovations vectors In Stepfllof the MCMC sampler (Section 
111) we motivate the sampling of the innovation vector. We have shown in Eq. ( Jl6| ) that 
conditional on the data and other model parameters, the innovations are independent of 
each other and we only need to sample ej i4 when both the current observed value, yi^, and 
the previous observed value, j/i,t-i, have positive values. The values of e^ in this case will 
range between max{0, y^t — yi,t-i} < U,t < Vi,t- Otherwise, e i)t is set deterministically The 



distribution of a single innovation, 6j jt is a : 

P(ei,t\Vi,t-i, Vi,t, di, X'i, 0) oc P{y itt \e it t, Vi,t-i, oi.i)P{c i:t \\\, 6) 



OC 



y*- 1 \ a [y^-^\i g^vi.t-i-ivi.t-u.t)) e H s{t) • ( A ^ • 9 »®Y 

Vi,t - ti,t) l " Q,t! 

A z .6> s ( t )(l - «i) 



ei,t!(?/i,t -e J,t) ! (yi,t-l _ (&,* _e i,t)) ! V a i 

1 1 /A 2i 6» s(t) (l -ai) 



U,t 



Q e itt \(yi it - ej jt )!(yj jt _i - (y ijt - e i:t ))\ \ on 

We can calculate the normalization constant C- h by summing over all the possible values of 
€ij for a given set of values of y^ t and y^t-i- 

Step 2 - Sampling the membership indicator In the following equations we con- 
struct the posterior distribution for the membership indicator variable: 

P(zi = j\z-i,S,0) oc P(Si\Si I e {v : z v = j,v ^i},0) ■ P(zi= j\zi I G {v : z v = j,v ^ i}). 

It is straightforward to show that the above distribution has the following form: 

a ■ Pi o for k = K + 1 
P{z i = k\z- i ,S,0)<x ' 

n k -p i)k for k = 1,. .. ,K, 
where Pi,o,Pi,i, ■ ■ ■ ,P%,k take on values from the following negative binomial distributions: 



Pifi = 



r( 7 i)^! \Q + i2j V© + 72 



Si 



r(Sj + A j + 7 i) L 



Pi ' J r(A j + 7l )^! V 1 nj ■ e + 7 2 J U • © + 72 J ; " ' /v ' 



A J+7i / o \ £ 



e \ W1 / e 



and rij = J2i=i^z t =j an d Aj = J2i- Z =jim^i- This distribution corresponds to the posterior 
distribution shown in Step |2] of the MCMC sampler (Section |2J. 

Step 3 - Sampling the unique rate vector Since we use a gamma distribution as 
the DP base measure, the resulting conditional posterior distribution for the unique cluster- 
specific rates is as follows: 

P(<f>k\z, S, 0,71,72) oc P(Si, I e {v : z v = k}\(p k , 0,71,72) ■ P((j)k\ji, 72) 

OC ( h Bk+ ^ 1 ~ 1 e -<Pk-(n k -e+j 2 ) 

This has the form of a gamma distribution with parameters B^ + 71 and n& • + 72 where 
E>k = J2ie{vz v =k\ &i "which is the distribution described in Step [3J of the MCMC sampler 
(Section g. 

Step 4 ~ Sampling the seasonal effect Let R t = X^t=i e i,*> then the conditional 
posterior distribution for the seasonal effect is: 

P(^|A,z,e,a,6) oc P(e t , I e {t : s(t) = j}|A,£i,6) • P(^!,6) 

K ^E t:s(t)=J i? *+^- 1 e -e i -(m J -i:f =1 a; +6) 



As previously described in Step E] of the MCMC sampler (Section [4]), this is a gamma 



distribution with parameters Yut- S (t)=j -^* + £1 anc ^ m j ' Xw=i \ + £2 where m^ = |t : s(t) = j 



Step 5 - Sampling the thinning value The prior distribution for the thinning value 
ai is a beta distribution, resulting in the conditional posterior distribution: 

P(<Xi\y, 6, T]!, 772) OC «2L(Kr«,t)+'/rl . ^ _ a )E^ a (w,t-i-(l«,i-ei,*))-H»-l 

As previously described in Step[5]of the MCMC sampler (Section]!]), this is a beta distribution 
with parameters J2t=2 Vi,t ~ S i + Vi and J2t=2(Vi,t-i ~ Vi,t) + -% + ??2- 

S'iep 6 - Sampling the concentration parametr 



We follow Escobar and West (1994) and use a gamma distribution prior for the concentra- 
tion parameter, r. This stage requires first sampling an auxiliary variable k which is then 
used to sample r: 

1. Sample n ~ Beta(r + 1, L). 

2. Sample r as the following mixture of two gammas: 

t\k, K ~ 7rGamma(a T + K, b T — log(K)) 

+ (1 — 7r)Gamma(a T + K — 1, b T — log(K)), 

with weight 7r defined by 7r/(l — 71) = (a T + K — l)/(L ■ [b T — log(K)}) where K is the 
number of unique clusters. 

B Conditional least squares model 

The PoINAR(l) model can be described in the following manner: 

y t+1 = aoy t + e t+ i t = 1, . . . , T - 1 (27) 

e t ~ Poiss(A • 9 s{t) ) (28) 
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The one-step-ahead conditional expected value for y t+1 is: 

y t+1 = a ■ y t + X • 9 s (jt) (29) 

The conditional least squares methods estimates this model's parameters by solving the 
following equation 

T T 12 

^ T n a y^iVt-yt) 2 = ^ min y2(y t -a-y t + X-9 s{t) ) 2 s.t.V# s = l. (30) 

t=2 i=2 s=l 

This is a nonlinear convex optimization problem. The Lagrangian method yields the follow- 
ing conditions: 



a = «^-).^^i (31) 

v^T o V^ r „,2 



Ef= 2 y* - vt-i _ x Ef= 2 Qs(t)yt- i 

2^/t=2Vt-l l^t=2Vt-\ 

(EL vt-i) ■ (Ef= 2 yt ■ 0s(t)) - (EL yt ■ y*-i) (EL yt-Aw, 

A = jj (32) 

(ELi/ii) • (EL^)) - (EL% 2 -i • *. w ) 

2 • A V. m _ . (y t — a ■ y t -i) — C 

e t = — ^m*)^ y^l — i = i,...,i2 (33) 

2A 2 • n,- 



2- A 



C ~ v 12 -i- (^ 

Z-a=l n , \j=l 



Et: S (i)=i (y* - « • yt-i) 



71; 



(34) 



Starting from a set of initial values, we can iterate between these equations and converge to 
a solution. The convergence is met within a few cycles. 

C Simulation study 

To assess the performance of our model, we simulate 9 different datasets from our multi- 
variate PoINAR(l) process. Each dataset has L = 100 time series (locations) with T = 208 
observations. The multiple time series are grouped into four equally sized clusters defined by 
a shared rate value, <pk- The different data sets vary in the levels of separation between the 



cluster rates and the time series autocorrelation values, a.\. In this section, we evaluate the 
performance of our methods both in- and out-of-sample. The results show that our model 
can reasonably recover the ground-truth clusterings and also produce accurate out-of-sample 
forecasts under various settings. Our model also outperforms the conditional least-squares 
model (CLS), which is detailed in Section [B] One reasonable explanation for these results is 
that the CLS model does not allow for sharing of information between the time series and 
therefore is more prone to noise variation. 

C.l Simulations settings 

We have two main factors that we configure in each of the simulated data sets: 

• The clusters' assigned rate values </>/-. We examine an "easy" setting in which the four 
cluster rate values are 1, 3, 6, 10, a "medium setting" with values 0.01, 0.5, 1.2, 2 and a 
"hard" setting with values 0.1, 0.2, 0.3, 0.6. The rates values are well separated in the 
easy setting and becomes harder to distinguish as we move to the hard setting. 

• The thinning value ai, which directly relates to the autocorrelation values of the in- 
dividual PoINAR(l) processes. We use three different thinning values shared between 
all locations: 0.1,0.5,0.9. 



Figure 10 illustrates examples of the simulated data for the three different rate scenarios 
using a thinning value of 0.3. In the "easy" setting, the tract means fall into four clear 
clusters. In the "hard" setting, it is much more difficult to distinguish between the four 
clusters solely based on the tract means. Furthermore, we can see that as the thinning value 
grows the tract means become larger and consequently it is easier to identify the clusters. 

C.2 Simulation results 

Although we are primarily interested in the out-of-sample performance of our method, 
there are still two important measures that are useful to examine in-sample: (i) How many 
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Figure 10: The top panel shows the rates values for the four different clusters, along with the 
prior distribution for the rates. The lower panel shows an example of 4 overlaid simulated 
time series. Each time series corresponds to a different cluster and is colored accordingly. 

clusters does our method recover? (ii) How close is the recovered clustering assignment to 
the true assignment? By finding accurate clusterings, our method can borrow information 
across the multiple time series yielding more accurate out-of-sample predictions. 



We run our sampler for 1000 iterations for each of the 9 settings. Figure 11 displays 
histograms for the number of inferred clusters for each of the scenarios plotted in Figure [10| 



Figure 11 also shows the Hamming distances between the estimated and true clustering 
assignment labels. The distances are calculated by first choosing the optimal mapping of 
indices maximizing overlap between the true and estimated labels assignment sequences. As 
seen, the modal number of clusters is four in all of the three settings and, as expected, the 
method recovers the true clustering assignment more accurately for the easy setting than for 
the hard one. However, even for the hard setting, the Hamming distance errors are usually 
less than 10% indicating that most time series are correctly clustered. Although we only 
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Figure 11: In-sample simulations results. The top panel displays the histogram for the 
number of clusters over the 1000 iterations. The bottom panel shows the Hamming distance 
errors between the estimated and true cluster assignments versus MCMC iteration. 

display the in-sample analysis for these three settings, these results generally hold for all 9 
settings. 

An interesting question is whether the methodology finds clustering structure when in fact 
all the time series belong to the same cluster. To examine this, we simulate a data set that 



has all of the time series grouped into a single cluster. Figure 12 shows the data and the 
results for the corresponding MCMC sampler. The model predominantly prefers to group 
all of the time series together, as we would hope in such a case. 

In order to evaluate our model's estimation performance, we compare the estimated one- 
step- ahead conditional expectation of the PoINAR(l) versus its corresponding true (simu- 



lated) population value. Brannas (1993) showed that the /i-step-ahead conditional expecta- 
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Figure 12: Simulation results for a case where there is a single cluster. The top panels 
display the single cluster rate value and an example of one of the time series' count data. 
The bottom panels display the histogram of the number of clusters our method finds and 
the corresponding Hamming distance errors versus MCMC iteration. 



tion of the PoINAR(l) model is: 



y T+h = E(y T+h \y 1 ,...,y T ) 



E(a h oy T + ^2a h j o e T+j \y T ) 

h 
3=1 



a h -y T 



(35) 



To use this predictor for our multivariate PoINAR(l) process, we need to produce estimators 



for the rates value, A, the seasonal effects values, S , and the thinning value, a, for the L 
time series. To this end, we run the MCMC sampler for m = 1000 iterations and discard 
the first 100 of them as burn-in. We then thin every 5 th iteration which leaves us with 180 
iterations from which to infer the parameters in our model. The one-step-ahead conditional 
expected value for the m th iteration is: 



Vt- 



|\'M _.M a(m) _ Am) , ,'(m) fl (m) / qfi \ 



For each time series (location), we now have samples from the posterior distribution of the 
conditional expected value which we average to produce the corresponding estimated value. 
We compare the performance of our method with two benchmark methods: the conditional 
least-squares (CLS) method and a simple Poisson process (SPP). Since the CLS method 
models the PoINAR(l) process for each time series separately, we estimate its parameters 



correspondingly. We then plug these estimators into Eq. (35) to produce the corresponding 
one-step-ahead predicted value for each of the time series. For further details on the CLS 
method, the reader is referred to Section |Bj The SPP assumes, for a single time series, the 
observed counts are independent identically distributed Poisson random variables with a 
constant rate value, A. Therefore, we estimate A for each time series using its corresponding 
counts average and then use this as the one-step-ahead predictor. 

To evaluate the different methodologies we use root mean square error (RMSE) and ab- 
solute percentage error (APE) between the true population expected value and its corre- 
sponding estimated value based on the L = 100 time series. The results of this analysis are 
presented in Table [3j The analysis reveals that our method consistently yields more accurate 
results compared to the CLS method and the SPP. As expected, the better the separation 
between the cluster rate values, the easier it is for our method to estimate the parameters 
more accurately. In addition, generally higher autocorrelation values produce lower APE but 
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Thin 
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Easy 


Med 
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Easy 


Med 


Hard 


Easy 


Med 


Hard 


SPP RMSE 


0.477 


0.113 


0.005 


1.674 


0.880 


0.293 


6.128 


1.155 


0.552 


CLS RMSE 


0.306 


0.080 


0.035 


0.284 


0.114 


0.057 


0.343 


0.118 


0.055 


BNP RMSE 


0.219 


0.058 


0.026 


0.260 


0.086 


0.045 


0.299 


0.075 


0.043 


SPP APE 


0.067 


0.159 


0.1241 


0.1958 


0.3192 


0.2013 


0.1230 


0.3372 


0.2737 


CLS APE 


0.041 


0.142 


0.110 


0.024 


0.120 


0.093 


0.006 


0.090 


0.047 


BNP APE 


0.033 


0.041 


0.072 


0.019 


0.033 


0.044 


0.005 


0.046 


0.022 


E(v,t+i) 


5.383 


1.001 


0.317 


9.861 


1.848 


0.591 


52.161 


9.908 


3.0633 



Table 3: Conditional mean estimation comparison between the CLS, SPP and our Bayesian 
nonparametric (BNP) method. The first four rows show the mean square error (MSE) and 
the absolute percentage error (APE) between the population (true) one-step-ahead condi- 
tional mean and its corresponding estimated value. The last row shows the average popula- 
tion (true) conditional expected value. 

higher RMSE. Intuitively because the stationary distribution mean value for the PoINAR(l) 
process is E(j/) = y^-, higher autocorrelation, a, yields a higher marginal mean value (or 
alternatively higher count values). This indicates a larger separation between the clusters 
counts values for the data sets with higher a. Therefore, higher autocorrelation helps our 
method identify the "true" clusters and yield more accurate estimators based on shrinking. 
In conclusion, we believe that because the CLS and SPP methods consider each time series 
separately, they are more prone to over-fitting. The proposed Bayesian methodology allows 
the estimates to pool information from several time series resulting in more robust parameter 
estimates. 



D Washington, D.C. population density map 



Figure [13] shows the map of population density across the 188 Washington, D.C. census 
tracts. 



E Washington D.C. crimes time series 



Figure 14 shows the time series of weekly counts of violent crimes in four tracts between 
2001 and 2008. We again see that some tracts can have very few occurrences whereas others 
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Figure 13: Washington, D.C. map of population density across the 188 census tracts. 
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Overall 


SPP bias 


0.0873 
(0.0303) 


0.092 

(0.0458) 


0.2080 
(0.095) 


0.4730 
(0.1761) 


0.5308 
(0.2434) 


0.126 
(0.0379) 


CLS bias 


-0.1625 
(0.0383) 


-0.0834 
(0.0441) 


0.1310 

(0.0716) 


0.3287 
(0.1276) 


0.3849 
(0.3736) 


-0.073 
(0.0405) 


BNP bias 


-0.0234 

(0.0156) 


0.0690 

(0.0393) 


0.2175 
(0.0710) 


0.2196 

(0.1143) 


0.19262 

(0.3528) 


0.0456 

(0.0232) 


Frequency 


0.5900 


0.2340 


0.1160 


0.0400 


0.0200 


1 



Table 4: One-step-ahead average bias as a function of the last observed value of y.^- 

have as many as 9 violent crimes per week. Also, since the counts are both discrete and 
small, it is hard to see clear seasonality within the weekly series. 

F Bias analysis 

In Table |4| we provide a summary of the average one- week-ahead bias for the Washington, 
D.C. crime data analysis. As we see from the results, in general, our method produces the 
smallest bias, but the differences between the methods are not significant except when the 
last observed count is zero. 
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Figure 14: Weekly violent crime counts between 2001 and 2008 in 4 census tracts. 
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